@gen#nneA 

Journals.ASM.org 

Complete Genome Sequence of Coprothermobacter proteolyticus DSM 
5265 

Alexandra Alexiev,^ David A. Coil,^ Jonathan H. Badger,'' Julie Enticknap,^ Naomi Ward,"' Frank T. Robb,^ Jonathan A. Eisen^'f 

University of California Davis Genome Center, Davis, California, USA'; J. Craig Venter Institute, La Jolla, California, USA*"; Cumnor House School, Danehill, West Sussex, 
United Kingdom^; University of Wyoming, Laramie, Wyoming, USA''; University of (Maryland School of Medicine, Baltimore, Maryland, USA^; University of California Davis, 
Department of Evolution and Ecology, Department of Medical Microbiology and Immunology, Davis, California, USA' 

Here we present the complete 1, 424,9 12-bp genome sequence of Coprothermobacter proteolyticus DSM 5265, isolated from a 
thermophilic digester fermenting tannery wastes and cattle manure. 
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Coprothermobacter proteolyticus is a nonmotile, non-spore- 
forming, rod-shaped, Gram-negative anaerobic bacterium 
isolated from a thermophilic consortium fermenting tannery 
wastes and cattle manure (1). C. proteolyticus has increased utili- 
zation of fructose, mannose, glucose, maltose, and sucrose with 
the addition of yeast extract with either rumen fluid or Trypticase 
peptone compared to when it is grown without these additives ( 1 ) . 
It was first considered a member of the genus Thermobacteroides 
but was latter reclassified as Coprothermobacter proteolyticus (2). 
C. proteolyticus was selected in 2002 as part of a National Science 
Foundation-funded "Assembling the Tree of Life" project at the 
Institute for Genomic Research (TIGR) to sequence the genomes 
of representatives of the seven phyla of bacteria that at the time 
had cultured representatives but no available genome sequence. 

C. proteolyticus DSM 5265 was grown in DSM medium 481, 
and DNA was extracted using standard techniques. Sanger se- 
quencing and genome assembly were performed as previously 
described for genomes sequenced by TIGR (3-5). Small and 
large insert plasmid libraries were constructed in pUC-derived 
vectors after random mechanical shearing (nebulization) of 
genomic DNA. 

Sequencing resulted in 14,614 reads with an average read 
length of 1 ,039 bp and a coverage estimate of 10 X . Sequences were 
assembled using Celera Assembler (6). The coverage criteria were 
that every position required at least double-clone coverage (or 
sequence from a PGR product amplified from genomic DNA) and 
either sequence from both strands or two different sequencing 
chemistries. The sequence was edited manually, and additional 
PGR and sequencing reactions were done to close gaps, improve 
coverage and resolve sequence ambiguities (7). All repeated DNA 
regions were verified by PGR amplification across the repeat and 
sequencing of the product. The fuU assembly consists ofl,424,912 
bases and has a G+G content of 44.8%. 

The replication origin was determined by colocalization of 
genes (dnaA, dnaN, recF, and gyrA) often found near the origin in 
prokaryotic genomes and G+G nucleotide skew (G-G/G+G) 
analysis (8). Gompleteness of the genome was assessed using the 



Phylosift software (9), which searches for 40 highly conserved, 
single copy marker genes (10). Thirty-nine of these 40 markers 
were found in this assembly and the missing marker (encoding 
porphobilinogen deaminase) was only found in 80% of the orig- 
inal 1,000 genomes used to generate the markers. 

An initial set of open reading frames likely to encode proteins 
( coding sequences [ GDSs ]) were predicted as previously described 
(7). All predicted proteins larger than 30 amino acids were 
searched against a nonredundant protein database as previously 
described (7). Protein membrane-spanning domains were identi- 
fied by TopPred (11). The 5' regions of the GDSs were inspected 
to define initiation codons using similarity searches and to iden- 
tify positions of ribosomal binding sites and transcriptional ter- 
minators. Two sets of hidden Markov models were used to deter- 
mine GDS membership in families and superfamUies: Pfam vl 1.0 
(12) and TIGRFAMs 3.0 (13). Pfam vl 1.0 hidden Markov models 
were also used with a constraint of a minimum of two hits to find 
repeated domains within proteins and mask them. This annota- 
tion was submitted with the genome in 2008, but in 2014 we re- 
quested an in-place update of the annotation from NGBI, using 
their integrated PGAP pipeline (14). 

Nucleotide sequence accession numbers. This genome se- 
quence has been deposited at DDBJ/EMBL/GenBank under the 
accession no. GP001145. The version described in this paper is 
version GP001145.1. 
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